install.packages(c("rio", "tidyverse", "janitor", "lubridate", "rmarkdown", "fs"))Nomilo Fishpond Biogoechemical Analysis
Throughout this document, hover over the numbered annotations to the right of code chunks to reveal detailed explanations and comments about the code. Where drop-down italicized text is present, expand by pressing on arrow to see code.
Install Packages
Load Libraries
library(rio)
library(tidyverse)
library(janitor)
library(lubridate)
library(rmarkdown)
library(fs)- 1
- For importing excel data
- 2
- For cleaning of data
- 3
- For cleaning variable names
- 4
- For cleaning dates
- 5
- For displaying tables
- 6
- For file path usage
Import Raw Data
Procedure
Define vector of files to import:
files_to_import <- dir_ls(path = "data/raw")
for (i in seq_along(files_to_import)) {
cat(i, "= ", files_to_import[i], "\n")
}- 1
-
Store the file paths of our raw data within the
data/rawdirectory infiles_to_import - 2
- Print each file path with its index
1 = data/raw/2024-02-28_dfs.RData
2 = data/raw/2024-02-28_ksf-clam-growth.xlsx
3 = data/raw/2024-02-28_ksf-compiled-data.xlsx
4 = data/raw/2024-02-28_ksf-oyster-cylinder-growth.xlsx
5 = data/raw/2024-02-28_profile-data.xlsx
6 = data/raw/2024-02-28_water-samples.xlsx
7 = data/raw/2024-02-28_weather-data.xlsx
8 = data/raw/2024-03-01_dfs-no-profiles.RData
9 = data/raw/2024-03-01_dfs_no_profiles.RData
10 = data/raw/~$2024-02-28_profile-data.xlsx
11 = data/raw/~$2024-02-28_water-samples.xlsx
Use the purrr::map() function to iteratively import files in the files_to_import vector except for the profiles data and .RData files:
The @iteratively-import-raw-data code chunk should only be ran once when raw data is updated because it takes long to execute. Therefore, run the @efficiently-load-raw-data code chunk instead to easily import up-to-date raw data.
dfs_no_profiles <- map(files_to_import[c(2:4, 6, 7)], import_list)
current_date <- format(Sys.Date(), "%Y-%m-%d")
save(dfs_no_profiles, file = paste0("data/raw/", current_date, "_dfs-no-profiles.RData"))Efficiently import up-to-date raw data:
load(files_to_import[8])Rename datasets:
We will always use snakecase when naming our data objects and functions (e.g., data_object_name or function_name()).
names(dfs_no_profiles) <- gsub("data/raw/2024-02-28_|\\.xlsx$|\\.xls$", "",
files_to_import[c(2:4, 6, 7)])
names(dfs_no_profiles) <- gsub("-", "_", names(dfs_no_profiles))
names(dfs_no_profiles)- 1
- Remove prefixes and file extensions
- 2
- Replace hyphens with underscores
- 3
- Check if names were outputted correctly
[1] "ksf_clam_growth" "ksf_compiled_data"
[3] "ksf_oyster_cylinder_growth" "water_samples"
[5] "weather_data"
Rename each sheet within each raw dataset to be lowercased and replace spaces with underscores:
dfs_no_profiles <- map(dfs_no_profiles, ~ set_names(.x, gsub(" ", "_", tolower(names(.x)))))Create separate datasets by specifying the Excel sheet from each spreadsheet we want to tidy:
ksf_clam_growth_data <- dfs_no_profiles$ksf_clam_growth$sheet1
ksf_compiled_data <- dfs_no_profiles$ksf_compiled_data$full_data
ksf_oyster_cylinder_growth_data <- dfs_no_profiles$ksf_oyster_cylinder_growth$sheet1
water_samples_data <- dfs_no_profiles$water_samples$data_overview
weather_data <- dfs_no_profiles$weather_data$weather_ksfWe want to combine multiple sheets within the profiles Excel spreadsheet into one, therefore, we will import it separately:
sheets_to_import <- c("L1", "L2", "L3", "L4")
profiles_data <- profiles_data <- map_dfr(sheets_to_import, function(sheet_name) {
import(files_to_import[5], which = sheet_name)
}) %>%
bind_rows()- 1
- [code annotation]
- 2
- [code annotation]
- 3
- [code annotation]
View Raw Data
Tidy Raw Data
Tidying Processes
Steps to clean data
new_var_names <- c(
"sort_date", "color", "clams_in_count", "clams_in_lbs", "clams_in_avg_per_lb",
"clams_out_count", "clams_out_lbs", "clams_out_avg_per_lb", "growth_in_lbs",
"growth_pct", "sr", "days_btwn_sort"
)
new_date_col <- c(
"2023-10-17", "2023-12-06", "2023-12-12", "2024-01-02", "2024-01-10", "2024-01-24",
"2024-01-31", "2024-02-08", "2024-02-13"
)
new_date_col <- c("2023-10-17", "2023-12-06", "2023-12-12", "2024-01-02", "2024-01-10", "2024-01-24", "2024-01-31", "2024-02-08", "2024-02-13")
ksf_clam_growth_data_tidied <- ksf_clam_growth_data %>%
slice(-1) %>%
setNames(new_var_names) %>%
mutate(date = as.Date(new_date_col)) %>%
dplyr::select(-sort_date) %>%
pivot_longer(
cols = c(
clams_in_count, clams_in_lbs, clams_in_avg_per_lb, clams_out_count,
clams_out_lbs, clams_out_avg_per_lb
),
names_to = c("stage", ".value"),
names_prefix = "clams_",
names_sep = "_",
values_to = "value"
) %>%
mutate(stage = if_else(str_detect(stage, "in"), "In", "Out")) %>%
rename(avg_per_lbs = avg) %>%
mutate(across(c(color, stage), as.factor)) %>%
mutate(across(c(count, lbs, avg_per_lbs, growth_in_lbs, growth_pct, sr),
~as.numeric(gsub("%", "", .)))) %>%
arrange(date, color, stage) %>%
dplyr::select(date, days_btwn_sort, color, stage, count, lbs, avg_per_lbs,
growth_in_lbs, growth_pct, sr)
paged_table(ksf_clam_growth_data_tidied)Steps to clean data
ksf_compiled_data_tidied <- ksf_compiled_data %>%
rename_with(~gsub("\\s*\\([^\\)]+\\)", "", .x)) %>%
janitor::clean_names() %>%
rename(date = date_time) %>%
mutate(date = as.Date(date)) %>%
filter(date >= as.Date("2023-11-20") & date <= as.Date("2024-02-20")) %>%
arrange(date) %>%
dplyr::select(-c(external_voltage, wk_num, wind_dir,
spadd, outdoor_temperature, hourly_rain,
solar_radiation, resistivity, battery_capacity,
hour, daynum, data_pt, wind_sp, diradd,
wind_speed, wind_direction, tide, day, month, year)
) %>%
dplyr::select(where(~ !anyNA(.))) %>%
group_by(date) %>%
summarise(across(where(is.numeric), \(x) mean(x, na.rm = TRUE)))
paged_table(ksf_compiled_data_tidied)- 1
- Clean variable names by removing everything in parentheses, using lowercase and underscores in place of spaces
- 2
-
Rename the
date_timevariable todate, filter to desired date range and sort bydate - 3
- Remove unnecessary variables
- 4
- Remove columns with containing all NA values
- 5
-
Group by
dateand calculate the average of every variable for each day
Steps to clean data
oyster_var_names <- c(
"date", "oyster_large_weight", "oyster_large_gain", "oyster_small_weight",
"oyster_small_gain", "oyster_chlorophyll"
)
ksf_oyster_cylinder_growth_data_tidied <- ksf_oyster_cylinder_growth_data %>%
dplyr::select(c(1, 4, 5, 8, 9, 12)) %>%
slice(-1) %>%
setNames(oyster_var_names) %>%
pivot_longer(
cols = c(oyster_large_weight, oyster_large_gain,
oyster_small_gain,
oyster_small_weight),
names_to = c("oyster_size", ".value"),
names_prefix = "oyster_",
names_sep = "_",
values_to = "value"
) %>%
mutate(oyster_size = if_else(str_detect(oyster_size, "small"), "Small", "Large")) %>%
mutate(oyster_size = as.factor(oyster_size),
across(c(weight, gain), as.numeric)
) %>%
filter(date >= as.Date("2023-11-20") & date <=
as.Date("2024-02-14"))
paged_table(ksf_oyster_cylinder_growth_data_tidied)- 1
- Manually set variable names
- 2
- Select desired columns and remove first row
- 3
- Convert from wide to long format
- 4
- Create a new variable that differentiates oyster size
- 5
- Adjust data types to numeric and factor
- 6
- Filter to desired date range
Address the NA values before merge
Steps to clean data
water_samples_data_tidied <- water_samples_data %>%
slice(-c(44:52)) %>%
rename_with(~gsub("\\s*\\([^\\)]+\\)", "", .x)) %>%
janitor::clean_names() %>%
mutate(
date = if_else(date == "44074",
as.character(as.Date("2024-01-09")),
format(dmy(date), "%Y-%m-%d"))
)
paged_table(water_samples_data_tidied)Done: - Date format - Date number conversion
The reason why it wasn’t working was because the date column was of a character data type, therefore we can easily convert 44074 but we were assuming that R knew the other dates were of a date type. So to fix this, we first needed to convert it into a date in the original format of DD-MM-YYYY, then convert it to our desired format of YYYY-MM-DD.
Need to do: - Names clean up - Add numbers to sample ID
Steps to clean data
weather_data_tidied <- weather_data
paged_table(weather_data_tidied)Steps to clean data
profiles_data_tidied <- profiles_data
paged_table(profiles_data_tidied)Merge Tidied Datasets
Export Tidied Datasets
Export tidied datasets to CSV into data/tidied folder:
source("code/functions/export_to_csv.R")
dfs_to_export <- list(
ksf_clam_growth_data_tidied = ksf_clam_growth_data_tidied,
ksf_compiled_data_tidied = ksf_compiled_data_tidied,
ksf_oyster_cylinder_growth_data_tidied = ksf_oyster_cylinder_growth_data_tidied
)
imap(dfs_to_export, ~ export_to_csv(.x, .y, "data/tidied"))- 1
- List of dataframes we want to export as CSV files
- 2
-
Iterate the
export_to_csv(df, df_name, dir_path)function over each dataframe..xrefers to the dataframe..yrefers to the name of the dataframe. These are passed toexport_to_csv()function along with the desired directory path.
Export merged final data set into data/outputs folder.